Introduction

In this EDA, I explore a dataset of airline on-time performance to try and find insights to flight delay and cancellations. The dataset used is a very large dataset that consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. There are over 120 million observations (flights) in this dataset for flights. The data was compressed into individual CSV files for each year.

I chose to explore this particular dataset because it would allow me learn new skills and optimization techniques for handling large datasets.

Due to the size of this dataset, it would very difficult to load the data into a Pandas dataframe in memory without reducing it to a very small subset of the data, so I decided to employ the use of R markdown (instead of Jupyter notebook) so that I can use R packages, along with SQL queries, to wrangle the data into a more summarized format that a Pandas dataframe can handle.

Preliminary Wrangling

Importing R Libraries

library(tidyverse)
library(skimr)
library(dplyr)
library(here) # To locate files based on current working directory
library(janitor) # Tools for for examining and cleaning dirty data.
library(reticulate) # For reading R objects in Python
library(data.table) # For reading large datasets efficiently
library(inborutils) # For reading CSV files and converting to SQL
library(DBI) # Interface to connect with SQL databases
library(RSQLite) # For connecting with SQL databases

Importing Python Packages

import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pathlib
import os

Loading In The Data

This dataset contains 21 large CSV files of flight data for each year from 1987 to 2008, as well as some other CSV files for which contain extra information. I will not directly read any of the large CSV files because it would take too much memory. I will read in the other CSV files, then use R libraries and SQL queries to read in smaller samples of the data to explore.

path_column_data = "airline/airline_dataset_column_info.csv"
path_airport_data = "airline/other_data/airports.csv"
path_carrier_data = "airline/other_data/carriers.csv"
path_plane_data = "airline/other_data/plane-data.csv"
path_main = "airline/main_data" # Path to the main (yearly) CSV files

def absolute_file_paths(directory):
    data = {}
    files = os.listdir(directory)
    paths = [f"{path_main}/{file}" for file in files]
    
    for idx in range(len(files)):
        name = files[idx].split('.')[0]
        data[name] = paths[idx]
    return data

main_files = absolute_file_paths(path_main)
# Data on column descriptions for the main files
column_data = pd.read_csv(path_column_data)

# Data on different airports
airport_data = pd.read_csv(path_airport_data)

# Airline companies
carrier_data = pd.read_csv(path_carrier_data)

# Plane data, specifications and other info
plane_data = pd.read_csv(path_plane_data)
# Info on column descriptions for the main files
column_data <- read.csv(py$path_column_data)

# Data on different airports
airport_data <- read.csv(py$path_airport_data)

# Information on airline companies
carrier_data <- read.csv(py$path_carrier_data)

# Plane data, specifications and other info
plane_data <- read.csv(py$path_plane_data)

For the main data, I have written a script that reads in the data in smaller chunks and stores them in a database file (sqlite). Each year’s data is stored in its own table. I also store the other data in their own tables so that later, when needed, I can reference them using SQL joins. The process takes a while to run because of the large dataset (over 30 minutes on my PC).

Also, the original files are named by the year they represent. It is not be good practice to name a database table starting with a number, so the script adds a prefix to each name.

For now, I have added a condition so the code will only be executed if the sqlite file is not detected in the project root directory.

path_main <- "airline/main_data"
db_file <- "airline_data.sqlite"

save_in_sql <- function() {
  main_files <- list.files(path = path_main, full.names = TRUE)
  
  if (!file.exists(db_file)) {
    # Creating the airport data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_airport_data,
                table_name = "airports",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the carrier data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_carrier_data,
                table_name = "carriers",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the plane data table
    inborutils::csv_to_sqlite(
                csv_file = py$path_plane_data,
                table_name = "planes",
                sqlite_file = db_file,
                show_progress_bar = FALSE)
    
    # Creating the tables for each of the years' data
    for (csv in main_files) {
      csv_name <- strsplit(csv, "/|[.]")[[1]] # Splitting the csv name by "/" or "."
      csv_name <- csv_name[length(csv_name)-1] # Getting the second last element of the list
      table_name <- paste("table", csv_name, sep="_")
    
      print("Updating table: %s", table_name)
      inborutils::csv_to_sqlite(
                  csv_file = csv,
                  sqlite_file = db_file, 
                  table_name = table_name, 
                  pre_process_size = 1000,
                  chunk_size = 50000, 
                  show_progress_bar = TRUE)
    }
    
  }
}

save_in_sql()

Now I inspect the database file to be sure that all tables have been added and updated properly

airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db

db_tables <- dbListTables(airline_db) # List out the tables in the db
print(db_tables)
##  [1] "airports"     "carriers"     "planes"       "sqlite_stat1" "sqlite_stat4"
##  [6] "table_1987"   "table_1988"   "table_1989"   "table_1990"   "table_1991"  
## [11] "table_1992"   "table_1993"   "table_1994"   "table_1995"   "table_1996"  
## [16] "table_1997"   "table_1998"   "table_1999"   "table_2000"   "table_2001"  
## [21] "table_2002"   "table_2003"   "table_2004"   "table_2005"   "table_2006"  
## [26] "table_2007"   "table_2008"
db_1993_cols <- dbListFields(airline_db, "table_1993") # Column names for specific table in db
print(length(db_1993_cols))
## [1] 29

Structure of the dataset

We can see from the above result that there are 29 columns in the table and this is the same across all the tables (the yearly tables), they all have the same columns, but we don’t know exactly how many rows are in each table.

The code below is a script/query to return exactly the number of rows (observations) that are in each table. The query can take a few minutes to execute the first time.

# count_rows <- function() {
#     Table = character() # Empty vector/list to store table names
#     Row_Count = integer() # Empty vector/list to store row counts
#   
#     for (table in db_tables) {
#         query_rows <- sprintf("SELECT COUNT(*) AS Rows FROM %s", table)
#         row_count <- dbGetQuery(airline_db, query_rows)[[1]]
#     
#         Table <- c(Table, table) # Appending each table name to the vector
#         Row_Count <- c(Row_Count, row_count) # Appending each row count to the vector
#     }
#   
#     df_row_count <- data.frame(Table, Row_Count)
#     return(df_row_count)
# }
# table_row_count <- count_rows()
# table_row_count

We can now see the number of rows in each table, which sums up to over 120 million observations. To test the SQL connection, I load in the first 500 rows of data from a particular year (2005 dataset in this case) using SQL and the R interface

query_test <- "SELECT * FROM table_2005 LIMIT 500"

tbl(airline_db, sql(query_test)) # Runs the query and displays results without loading it in memory
## # Source:   SQL [?? x 29]
## # Database: sqlite 3.39.2 [O:\GitHub\data_analysis\alx\airline_performance_analysis\airline_data.sqlite]
##     Year Month DayofMo…¹ DayOf…² DepTime CRSDe…³ ArrTime CRSAr…⁴ Uniqu…⁵ Fligh…⁶
##    <dbl> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>
##  1  2005     1        28       5    1603    1605    1741    1759 UA          541
##  2  2005     1        29       6    1559    1605    1736    1759 UA          541
##  3  2005     1        30       7    1603    1610    1741    1805 UA          541
##  4  2005     1        31       1    1556    1605    1726    1759 UA          541
##  5  2005     1         2       7    1934    1900    2235    2232 UA          542
##  6  2005     1         3       1    2042    1900       9    2232 UA          542
##  7  2005     1         4       2    2046    1900    2357    2232 UA          542
##  8  2005     1         5       3      NA    1900      NA    2232 UA          542
##  9  2005     1         6       4    2110    1900       8    2223 UA          542
## 10  2005     1         7       5    1859    1900    2235    2223 UA          542
## # … with more rows, 19 more variables: TailNum <chr>, ActualElapsedTime <dbl>,
## #   CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## #   Origin <chr>, Dest <chr>, Distance <dbl>, TaxiIn <dbl>, TaxiOut <dbl>,
## #   Cancelled <dbl>, CancellationCode <chr>, Diverted <dbl>,
## #   CarrierDelay <dbl>, WeatherDelay <dbl>, NASDelay <dbl>,
## #   SecurityDelay <dbl>, LateAircraftDelay <dbl>, and abbreviated variable
## #   names ¹​DayofMonth, ²​DayOfWeek, ³​CRSDepTime, ⁴​CRSArrTime, ⁵​UniqueCarrier, …
top_rows <- dbGetQuery(airline_db, query_test) # Runs the query and stores it in a dataframe, in memory♂

dbDisconnect(airline_db) # Disconnect from the database when done

Now lets look at some summary statistics for the data. I will read in the first 5,000 rows of the 2007 flight data.

top_2007 = pd.read_csv(main_files['2007'], nrows=5000)
top_2007.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 5000 entries, 0 to 4999
## Data columns (total 29 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   Year               5000 non-null   int64  
##  1   Month              5000 non-null   int64  
##  2   DayofMonth         5000 non-null   int64  
##  3   DayOfWeek          5000 non-null   int64  
##  4   DepTime            4964 non-null   float64
##  5   CRSDepTime         5000 non-null   int64  
##  6   ArrTime            4963 non-null   float64
##  7   CRSArrTime         5000 non-null   int64  
##  8   UniqueCarrier      5000 non-null   object 
##  9   FlightNum          5000 non-null   int64  
##  10  TailNum            5000 non-null   object 
##  11  ActualElapsedTime  4963 non-null   float64
##  12  CRSElapsedTime     5000 non-null   int64  
##  13  AirTime            4963 non-null   float64
##  14  ArrDelay           4963 non-null   float64
##  15  DepDelay           4964 non-null   float64
##  16  Origin             5000 non-null   object 
##  17  Dest               5000 non-null   object 
##  18  Distance           5000 non-null   int64  
##  19  TaxiIn             5000 non-null   int64  
##  20  TaxiOut            5000 non-null   int64  
##  21  Cancelled          5000 non-null   int64  
##  22  CancellationCode   36 non-null     object 
##  23  Diverted           5000 non-null   int64  
##  24  CarrierDelay       5000 non-null   int64  
##  25  WeatherDelay       5000 non-null   int64  
##  26  NASDelay           5000 non-null   int64  
##  27  SecurityDelay      5000 non-null   int64  
##  28  LateAircraftDelay  5000 non-null   int64  
## dtypes: float64(6), int64(18), object(5)
## memory usage: 1.1+ MB

Most of the columns are numeric, some indicating arrival and departure, as well as different causes of delays. There are some binary columns such as “Cancelled” and “Diverted” which are important variables to analyze.

Features of interest

For this EDA, I am interested in exploring some of the ideas suggested on the source website which are:

  1. When is the best time of day/day of week/time of year to fly to minimize delays?
  2. Do older planes suffer more delays?
  3. How well does weather predict plane delays?

Generally, I am interested in exploring the cause of flight delays and cancellations.

Areas To Focus On

For this investigation, I will get the best insights by focusing on the “delay” columns. By analyzing the delays on each day of the week and each month, I believe I can get a good idea of the best times to fly. I will explore data for a single year. Then later on, I will compare the data across the other years to see if there are similar patterns across the years.

Univariate Exploration

I will start by analyzing the departure delays: DepDelay. I will be using the 2007 dataset. Because the data is so large, I will only read in some select columns into the dataframe.

df_2007 = pd.read_csv(main_files['2007'], usecols = ['Month', 'DayofMonth', 'ArrDelay', 'DepDelay', 'Distance'], nrows = 2_000_000)
df_2007.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2000000 entries, 0 to 1999999
## Data columns (total 5 columns):
##  #   Column      Dtype  
## ---  ------      -----  
##  0   Month       int64  
##  1   DayofMonth  int64  
##  2   ArrDelay    float64
##  3   DepDelay    float64
##  4   Distance    int64  
## dtypes: float64(2), int64(3)
## memory usage: 76.3 MB
df_2007.describe()

# Check for missing values
##               Month    DayofMonth      ArrDelay      DepDelay      Distance
## count  2.000000e+06  2.000000e+06  1.935147e+06  1.939160e+06  2.000000e+06
## mean   2.182453e+00  1.555414e+01  1.050200e+01  1.184517e+01  7.000558e+02
## std    9.717774e-01  8.695236e+00  3.906322e+01  3.582602e+01  5.458290e+02
## min    1.000000e+00  1.000000e+00 -3.120000e+02 -3.050000e+02  1.100000e+01
## 25%    1.000000e+00  8.000000e+00 -9.000000e+00 -4.000000e+00  3.080000e+02
## 50%    2.000000e+00  1.600000e+01  0.000000e+00  0.000000e+00  5.460000e+02
## 75%    3.000000e+00  2.300000e+01  1.500000e+01  1.200000e+01  9.300000e+02
## max    4.000000e+00  3.100000e+01  1.564000e+03  1.547000e+03  4.962000e+03
df_2007['DepDelay'].isnull().sum()
## 60840
df_2007[df_2007.ArrDelay.notnull()]
##          Month  DayofMonth  ArrDelay  DepDelay  Distance
## 0            1           1       1.0       7.0       389
## 1            1           1       8.0      13.0       479
## 2            1           1      34.0      36.0       479
## 3            1           1      26.0      30.0       479
## 4            1           1      -3.0       1.0       479
## ...        ...         ...       ...       ...       ...
## 1999995      4          30      22.0       0.0       741
## 1999996      4          30      48.0      50.0      1024
## 1999997      4          30     -12.0      -1.0       646
## 1999998      4          30      12.0      20.0       589
## 1999999      4          30     -21.0      -5.0       461
## 
## [1935147 rows x 5 columns]
# Changing column data types to reduce memory usage
df_2007 = df_2007.astype({'Month':'int8', 'DayofMonth':'int8', 'ArrDelay':'float32', 'DepDelay':'float32'})
binsize = 30
bins = np.arange(0, df_2007['DepDelay'].max()+binsize, binsize)

plt.figure(figsize=[14, 8])
plt.hist(data = df_2007, x = 'DepDelay', bins = bins)
## (array([7.76003e+05, 1.39764e+05, 5.98980e+04, 3.07130e+04, 1.66560e+04,
##        9.33500e+03, 5.72500e+03, 3.33200e+03, 1.94700e+03, 1.26600e+03,
##        7.70000e+02, 5.22000e+02, 3.16000e+02, 2.22000e+02, 1.61000e+02,
##        1.11000e+02, 1.02000e+02, 7.30000e+01, 5.90000e+01, 3.80000e+01,
##        3.70000e+01, 3.60000e+01, 2.90000e+01, 3.10000e+01, 1.60000e+01,
##        2.20000e+01, 1.10000e+01, 1.60000e+01, 1.30000e+01, 1.50000e+01,
##        1.80000e+01, 1.20000e+01, 7.00000e+00, 7.00000e+00, 1.50000e+01,
##        7.00000e+00, 1.10000e+01, 4.00000e+00, 3.00000e+00, 8.00000e+00,
##        1.00000e+00, 2.00000e+00, 1.00000e+00, 2.00000e+00, 4.00000e+00,
##        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
##        0.00000e+00, 1.00000e+00]), array([   0.,   30.,   60.,   90.,  120.,  150.,  180.,  210.,  240.,
##         270.,  300.,  330.,  360.,  390.,  420.,  450.,  480.,  510.,
##         540.,  570.,  600.,  630.,  660.,  690.,  720.,  750.,  780.,
##         810.,  840.,  870.,  900.,  930.,  960.,  990., 1020., 1050.,
##        1080., 1110., 1140., 1170., 1200., 1230., 1260., 1290., 1320.,
##        1350., 1380., 1410., 1440., 1470., 1500., 1530., 1560.]), <BarContainer object of 52 artists>)
plt.xlabel('Departure Delays (mins)')
plt.show()

The distribution is skewed to the left and there is a short tail. A large majority of the data falls within the range of 0 and 250 minutes. I would have gone for a logarithmic scale but this data has negative values (because there are flights that took off before the expected departure time). There are also some missing values.

# Flights that took off over 25 minutes earlier
df_2007[df_2007.DepDelay < -25].info()

# Lowest departure delay
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 165 entries, 99831 to 1977687
## Data columns (total 5 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Month       165 non-null    int8   
##  1   DayofMonth  165 non-null    int8   
##  2   ArrDelay    163 non-null    float32
##  3   DepDelay    165 non-null    float32
##  4   Distance    165 non-null    int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 4.2 KB
df_2007[df_2007.DepDelay < 0].DepDelay.min()
## -305.0

We can see that there are many flights that took off before the expected departure time (almost 50% of all the flights in that year). That is not unusual, especially if it falls within a few minutes and all passengers are available, but there are many flights that took off unusually early (over 30 minutes early, even up to 5 hours early). There can be many reasons for this but for now since this exploration is mainly focused on delay times and there are so many records to work with, I will only assess flights that were actually delayed,

df_2007_delayed = df_2007[df_2007.DepDelay > 0]
df_2007_delayed.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 867613 entries, 0 to 1999998
## Data columns (total 5 columns):
##  #   Column      Non-Null Count   Dtype  
## ---  ------      --------------   -----  
##  0   Month       867613 non-null  int8   
##  1   DayofMonth  867613 non-null  int8   
##  2   ArrDelay    865260 non-null  float32
##  3   DepDelay    867613 non-null  float32
##  4   Distance    867613 non-null  int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 21.5 MB
df_2007_delayed.describe()
##                Month     DayofMonth  ...       DepDelay       Distance
## count  867613.000000  867613.000000  ...  867613.000000  867613.000000
## mean        2.199465      15.268106  ...      31.058889     732.023085
## std         0.961845       8.696432  ...      46.776241     554.655358
## min         1.000000       1.000000  ...       1.000000      11.000000
## 25%         1.000000       8.000000  ...       5.000000     328.000000
## 50%         2.000000      15.000000  ...      15.000000     587.000000
## 75%         3.000000      22.000000  ...      38.000000     967.000000
## max         4.000000      31.000000  ...    1547.000000    4962.000000
## 
## [8 rows x 5 columns]

Now I will try to plot using a log scale

log_binsize = 0.025
bins = 10 ** np.arange(0, np.log10(df_2007_delayed['ArrDelay'].max())+log_binsize, log_binsize)

plt.figure(figsize=[14, 8])
plt.hist(data = df_2007_delayed, x = 'ArrDelay', bins = bins)
## (array([1.8399e+04, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
##        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
##        0.0000e+00, 0.0000e+00, 1.8498e+04, 0.0000e+00, 0.0000e+00,
##        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8307e+04,
##        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.8316e+04,
##        0.0000e+00, 0.0000e+00, 1.8036e+04, 0.0000e+00, 0.0000e+00,
##        0.0000e+00, 1.7587e+04, 0.0000e+00, 1.7161e+04, 0.0000e+00,
##        0.0000e+00, 1.6417e+04, 0.0000e+00, 1.5996e+04, 0.0000e+00,
##        1.5876e+04, 1.4837e+04, 0.0000e+00, 1.4764e+04, 1.4191e+04,
##        1.3548e+04, 0.0000e+00, 1.3196e+04, 1.2495e+04, 1.1998e+04,
##        1.1600e+04, 1.1132e+04, 2.1177e+04, 1.0075e+04, 9.4680e+03,
##        1.8348e+04, 8.5540e+03, 1.6208e+04, 7.6150e+03, 1.4665e+04,
##        1.3475e+04, 1.2803e+04, 1.1844e+04, 1.0983e+04, 1.5359e+04,
##        9.1860e+03, 1.3035e+04, 1.2071e+04, 1.1038e+04, 1.0027e+04,
##        9.0590e+03, 1.1371e+04, 7.6860e+03, 9.5340e+03, 8.6130e+03,
##        9.6740e+03, 8.6730e+03, 7.5970e+03, 6.9870e+03, 6.3160e+03,
##        6.6660e+03, 6.9340e+03, 5.1820e+03, 5.3670e+03, 5.2750e+03,
##        4.5440e+03, 3.8570e+03, 3.7300e+03, 3.1440e+03, 3.0050e+03,
##        2.7560e+03, 2.3030e+03, 2.0610e+03, 1.6000e+03, 1.6490e+03,
##        1.1400e+03, 1.0150e+03, 8.1500e+02, 7.6500e+02, 5.4800e+02,
##        4.7900e+02, 3.6400e+02, 2.7900e+02, 2.6300e+02, 1.6300e+02,
##        1.2500e+02, 1.0400e+02, 1.2000e+02, 8.0000e+01, 6.3000e+01,
##        4.9000e+01, 5.0000e+01, 3.9000e+01, 4.7000e+01, 3.5000e+01,
##        2.1000e+01, 2.2000e+01, 2.6000e+01, 2.5000e+01, 1.5000e+01,
##        2.0000e+01, 2.2000e+01, 8.0000e+00, 9.0000e+00, 4.0000e+00,
##        3.0000e+00, 1.0000e+00, 1.0000e+00]), array([1.00000000e+00, 1.05925373e+00, 1.12201845e+00, 1.18850223e+00,
##        1.25892541e+00, 1.33352143e+00, 1.41253754e+00, 1.49623566e+00,
##        1.58489319e+00, 1.67880402e+00, 1.77827941e+00, 1.88364909e+00,
##        1.99526231e+00, 2.11348904e+00, 2.23872114e+00, 2.37137371e+00,
##        2.51188643e+00, 2.66072506e+00, 2.81838293e+00, 2.98538262e+00,
##        3.16227766e+00, 3.34965439e+00, 3.54813389e+00, 3.75837404e+00,
##        3.98107171e+00, 4.21696503e+00, 4.46683592e+00, 4.73151259e+00,
##        5.01187234e+00, 5.30884444e+00, 5.62341325e+00, 5.95662144e+00,
##        6.30957344e+00, 6.68343918e+00, 7.07945784e+00, 7.49894209e+00,
##        7.94328235e+00, 8.41395142e+00, 8.91250938e+00, 9.44060876e+00,
##        1.00000000e+01, 1.05925373e+01, 1.12201845e+01, 1.18850223e+01,
##        1.25892541e+01, 1.33352143e+01, 1.41253754e+01, 1.49623566e+01,
##        1.58489319e+01, 1.67880402e+01, 1.77827941e+01, 1.88364909e+01,
##        1.99526231e+01, 2.11348904e+01, 2.23872114e+01, 2.37137371e+01,
##        2.51188643e+01, 2.66072506e+01, 2.81838293e+01, 2.98538262e+01,
##        3.16227766e+01, 3.34965439e+01, 3.54813389e+01, 3.75837404e+01,
##        3.98107171e+01, 4.21696503e+01, 4.46683592e+01, 4.73151259e+01,
##        5.01187234e+01, 5.30884444e+01, 5.62341325e+01, 5.95662144e+01,
##        6.30957344e+01, 6.68343918e+01, 7.07945784e+01, 7.49894209e+01,
##        7.94328235e+01, 8.41395142e+01, 8.91250938e+01, 9.44060876e+01,
##        1.00000000e+02, 1.05925373e+02, 1.12201845e+02, 1.18850223e+02,
##        1.25892541e+02, 1.33352143e+02, 1.41253754e+02, 1.49623566e+02,
##        1.58489319e+02, 1.67880402e+02, 1.77827941e+02, 1.88364909e+02,
##        1.99526231e+02, 2.11348904e+02, 2.23872114e+02, 2.37137371e+02,
##        2.51188643e+02, 2.66072506e+02, 2.81838293e+02, 2.98538262e+02,
##        3.16227766e+02, 3.34965439e+02, 3.54813389e+02, 3.75837404e+02,
##        3.98107171e+02, 4.21696503e+02, 4.46683592e+02, 4.73151259e+02,
##        5.01187234e+02, 5.30884444e+02, 5.62341325e+02, 5.95662144e+02,
##        6.30957344e+02, 6.68343918e+02, 7.07945784e+02, 7.49894209e+02,
##        7.94328235e+02, 8.41395142e+02, 8.91250938e+02, 9.44060876e+02,
##        1.00000000e+03, 1.05925373e+03, 1.12201845e+03, 1.18850223e+03,
##        1.25892541e+03, 1.33352143e+03, 1.41253754e+03, 1.49623566e+03,
##        1.58489319e+03]), <BarContainer object of 128 artists>)
plt.xscale('log')
# plt.xticks([500, 1e3, 2e3, 5e3, 1e4, 2e4], [500, '1k', '2k', '5k', '10k', '20k'])
plt.xlabel('Arrival Delays (mins)')
plt.show()

Bivariate Exploration

I will look at the departure delays and arrival delays. Specifically, I will focus on the arrival delay and its relationship with distance. I want to see if longer flights tend to have longer delays.

# Dropping rows with missing values
print("Original rows and columns =",df_2007.shape)
## Original rows and columns = (2000000, 5)
df_2007_sampled = df_2007.dropna(subset=['ArrDelay', 'Distance']).sample(n=20000, replace = False)
print("Sampled rows and columns =",df_2007_sampled.shape)
## Sampled rows and columns = (20000, 5)
print(df_2007_sampled.info())
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 20000 entries, 841926 to 869387
## Data columns (total 5 columns):
##  #   Column      Non-Null Count  Dtype  
## ---  ------      --------------  -----  
##  0   Month       20000 non-null  int8   
##  1   DayofMonth  20000 non-null  int8   
##  2   ArrDelay    20000 non-null  float32
##  3   DepDelay    20000 non-null  float32
##  4   Distance    20000 non-null  int64  
## dtypes: float32(2), int64(1), int8(2)
## memory usage: 507.8 KB
## None
print(df_2007_sampled.head())
##          Month  DayofMonth  ArrDelay  DepDelay  Distance
## 841926       2          19      50.0      50.0       862
## 1887239      4          20      -1.0       7.0       389
## 1991294      4          16      -8.0       0.0        64
## 507955       1           2      16.0      -6.0      1236
## 371612       1          20      -1.0      13.0      1946
plt.figure(figsize=[12, 12])
plt.scatter(data=df_2007_sampled, x='Distance', y='ArrDelay')
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()

There seems to be very little correlation from this plot. Since there are a lot of overlapping points, I will apply some transparency to get a better picture.

plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.xlabel("Distance in miles")
plt.ylabel("Arrival Delay in minutes")
plt.show()

Most of the flights had distances between the range of 2500 miles and delays within 400 minutes. Again I will be removing the negative values since I am only interested in flights that arrived later than expected.

df_2007_sampled = df_2007_sampled.query("0 < ArrDelay < 200")

plt.figure(figsize=[12, 12])
sns.regplot(data=df_2007_sampled, x='Distance', y='ArrDelay', scatter_kws = {'alpha': 1/5}, fit_reg = False);
plt.show();

Many of the flights with shorter distances seemed to have shorter delays since the density of the dots seem to fade as it gets away from the 0 mark. Lets try to get a different picture with another plot.

x_bins = np.arange(0, 4500+100, 100)
y_bins = np.arange(0, 200+7, 7)


plt.figure(figsize=[12, 12])
plt.hist2d(data=df_2007_sampled, x='Distance', y='ArrDelay', cmin=0.3, cmap='viridis_r', bins = [x_bins, y_bins]);
plt.show();

Its still the same story from what I can see here.

Day of Week Analysis

Next, I will analyze the delays on a day-of-week basis.

get_mean_delay_year <- function(year) {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
    
    # Query for mean delay times in January (actual delays, no early flights)
    query_jan <- sprintf("SELECT DayOfWeek, DayOfMonth, 
                                 AVG(DepDelay) AS MeanDepDelay, 
                                 AVG(ArrDelay) AS MeanArrDelay
                          FROM table_%s
                          WHERE DepDelay > 0 AND ArrDelay > 0
                          GROUP BY DayOfWeek, DayOfMonth", year)
    
    # tbl(airline_db, sql(query_jan)) # Runs the query and displays results without loading it in memory
    delays_2007 <- dbGetQuery(airline_db, query_jan) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(delays_2007)
}
delays_2007 <- get_mean_delay_year('2007')
head(delays_2007)
##   DayOfWeek DayofMonth MeanDepDelay MeanArrDelay
## 1         1          1     35.50271     35.54895
## 2         1          2     40.18925     41.42904
## 3         1          3     38.94795     39.51203
## 4         1          4     46.83710     49.46875
## 5         1          5     36.06113     37.46486
## 6         1          6     42.34605     44.32375
skim(delays_2007)
Data summary
Name delays_2007
Number of rows 216
Number of columns 4
_______________________
Column type frequency:
numeric 4
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DayOfWeek 0 1 3.99 2.00 1.00 2.00 4.00 6.00 7.00 ▇▃▃▃▇
DayofMonth 0 1 15.93 8.93 1.00 8.00 16.00 24.00 31.00 ▇▇▇▇▇
MeanDepDelay 0 1 38.37 6.48 24.67 33.25 37.68 41.89 61.84 ▃▇▆▁▁
MeanArrDelay 0 1 40.66 7.88 24.93 34.60 39.75 45.29 70.32 ▅▇▅▁▁

For the DayOfWeek data, I will make another column in the dataframe that shows the text representation (Monday, Tuesday …) so that it would be easier to understand in the plot.

def change_column_type(df):
    # Converting R dataframe to Pandas dataframe
    df_delays = pd.DataFrame(df)
    
    # Changing day and month columns from float to integer data types
    df_delays = df_delays.astype({'DayOfWeek':'int8', 'DayofMonth':'int8'})
    
    days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    
    # Creating the new column
    df_delays['Day'] = df_delays['DayOfWeek'].apply(lambda x: days_of_week[x-1])
    
    return(df_delays)

delays_2007 = change_column_type(r.delays_2007)
delays_2007.head()
##    DayOfWeek  DayofMonth  MeanDepDelay  MeanArrDelay     Day
## 0          1           1     35.502709     35.548949  Monday
## 1          1           2     40.189248     41.429041  Monday
## 2          1           3     38.947953     39.512035  Monday
## 3          1           4     46.837098     49.468746  Monday
## 4          1           5     36.061127     37.464863  Monday
plt.figure(figsize=[16,12])
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay')
plt.title("Average Flight Delay Times In 2007")
plt.xlabel("Day of Week")
plt.ylabel("Average Departure Delay (min)")
plt.show();

From the above chart alone, the average delay times for each weekday seems to be fairly the same. I will see if there is a consistent theme across the years by plotting the chart for 9 consecutive years (1999 - 2008)

I left the colors because, even though the weekday variable is ordinal (i.e. Tuesday comes after Monday and so on), the order doesn’t really matter much in this case because, for example, Friday is not better than Sunday, Monday is not higher than Saturday, etc. The colors will be helpful in identifying each weekday in the subsequent plots.

delays_2000 <- get_mean_delay_year('2000')
delays_2001 <- get_mean_delay_year('2001')
delays_2002 <- get_mean_delay_year('2002')
delays_2003 <- get_mean_delay_year('2003')
delays_2004 <- get_mean_delay_year('2004')
delays_2005 <- get_mean_delay_year('2005')
delays_2006 <- get_mean_delay_year('2006')
delays_2007 <- get_mean_delay_year('2007')
delays_2008 <- get_mean_delay_year('2008')
delays_2000 = change_column_type(r.delays_2000)
delays_2001 = change_column_type(r.delays_2001)
delays_2002 = change_column_type(r.delays_2002)
delays_2003 = change_column_type(r.delays_2003)
delays_2004 = change_column_type(r.delays_2004)
delays_2005 = change_column_type(r.delays_2005)
delays_2006 = change_column_type(r.delays_2006)
delays_2007 = change_column_type(r.delays_2007)
delays_2008 = change_column_type(r.delays_2008)
fig, ax = plt.subplots(ncols = 3, nrows = 3 , figsize = [17,17])

sns.barplot(data=delays_2000, x='Day', y='MeanDepDelay', ax = ax[0, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2001, x='Day', y='MeanDepDelay', ax = ax[1, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2002, x='Day', y='MeanDepDelay', ax = ax[2, 0])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2003, x='Day', y='MeanDepDelay', ax = ax[0, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2004, x='Day', y='MeanDepDelay', ax = ax[1, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2005, x='Day', y='MeanDepDelay', ax = ax[2, 1])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2006, x='Day', y='MeanDepDelay', ax = ax[0, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2007, x='Day', y='MeanDepDelay', ax = ax[1, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")
sns.barplot(data=delays_2008, x='Day', y='MeanDepDelay', ax = ax[2, 2])
# plt.title(f"Average Flight Delay Times In {delay_tables}")

plt.show();

From an overview of the chart above, there is no consistent trend to predict which weekdays have more delays. This is expected because there are many other factors to consider like the month, the season, holidays, airport carrier, plane age and global events.

Cancelled and Diverted

Now I will examine the cancelled and diverted flights, relative to the carriers (airline companies). I want to see if flights from a carrier tend to get get cancelled or diverted more than others.

Below is an SQL query to get all flights that were either diverted or cancelled and group them by the flight carrier. On the original table, there is a column for diverted (1 or 0) and another column for cancelled (1 or 0). I believe the data is not completely tidy because a flight that is cancelled cannot be diverted and vice-versa. So I combined them to a single column indicating whether the flight was cancelled or diverted.

get_changed_flights <- function(year) {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db
    query <- sprintf("SELECT 
                        UniqueCarrier,
                        Description AS Carrier,
                        CASE 
                            WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
                            WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
                        END AS FlightStatus,
                        COUNT(*) AS Flights
                      FROM table_%s
                      LEFT JOIN carriers
                      ON table_%s.UniqueCarrier = carriers.Code
                      WHERE Diverted = 1 OR Cancelled = 1
                      GROUP BY UniqueCarrier, Carrier, FlightStatus
                      ORDER BY UniqueCarrier", year, year)
    
    flights_changed <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(flights_changed)
}

flights_changed_2007 <- get_changed_flights('2007')
head(flights_changed_2007, 10)
##    UniqueCarrier                Carrier FlightStatus Flights
## 1             9E Pinnacle Airlines Inc.    Cancelled    7939
## 2             9E Pinnacle Airlines Inc.     Diverted     726
## 3             AA American Airlines Inc.    Cancelled   17924
## 4             AA American Airlines Inc.     Diverted    2097
## 5             AQ    Aloha Airlines Inc.    Cancelled     388
## 6             AQ    Aloha Airlines Inc.     Diverted      15
## 7             AS   Alaska Airlines Inc.    Cancelled    2563
## 8             AS   Alaska Airlines Inc.     Diverted     499
## 9             B6        JetBlue Airways    Cancelled    3710
## 10            B6        JetBlue Airways     Diverted     624

Lets see the number of cancelled and diverted flights in this data (the 2006 data).

flights_changed_2007 = pd.DataFrame(r.flights_changed_2007)

flights_changed_2007.head()
##   UniqueCarrier                 Carrier FlightStatus  Flights
## 0            9E  Pinnacle Airlines Inc.    Cancelled     7939
## 1            9E  Pinnacle Airlines Inc.     Diverted      726
## 2            AA  American Airlines Inc.    Cancelled    17924
## 3            AA  American Airlines Inc.     Diverted     2097
## 4            AQ     Aloha Airlines Inc.    Cancelled      388
print(f"Total number of diverted or cancelled flights in 2007: {flights_changed_2007.Flights.sum()}")
## Total number of diverted or cancelled flights in 2007: 177927
plt.figure(figsize=[16,12])
sns.barplot(data = flights_changed_2007, x = 'UniqueCarrier', y = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()

We can already see that MQ had the highest number of cancelled flights (by a relatively wide margin). To better understand the plot, I will make it horizontal and order it by number of cancelled flights.

#Sort by descending order of number of flights
flights_changed_2007 = flights_changed_2007.sort_values(by='Flights', ascending=False)

plt.figure(figsize=[20,14])

sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("Cancelled & Diverted Flights From Each Carrier In 2007")
plt.show()

Tts easier to see that the airline companies with the most number of cancelled flight is E and diverted flights are A B C. So we can at least judge that the carriers at the bottom of the chart have a good record of flight data (though I am aware that the chart could be like that because those companies at the bottom do not have as many flights as those at the top).

Now I’m going to plot the chart for the most recent 6 years from the dataset, to see if this trend is the same across the years, i.e, to see if the same companies are always on top.

flights_changed_2003 = get_changed_flights('2003') 
flights_changed_2004 = get_changed_flights('2004') 
flights_changed_2005 = get_changed_flights('2005') 
flights_changed_2006 = get_changed_flights('2006') 
flights_changed_2007 = get_changed_flights('2007') 
flights_changed_2008 = get_changed_flights('2008') 
flights_changed_2003 =pd.DataFrame(r.flights_changed_2003).sort_values(by='Flights', ascending=False)
flights_changed_2004 =pd.DataFrame(r.flights_changed_2004).sort_values(by='Flights', ascending=False)
flights_changed_2005 =pd.DataFrame(r.flights_changed_2005).sort_values(by='Flights', ascending=False)
flights_changed_2006 =pd.DataFrame(r.flights_changed_2006).sort_values(by='Flights', ascending=False)
flights_changed_2007 =pd.DataFrame(r.flights_changed_2007).sort_values(by='Flights', ascending=False)
flights_changed_2008 =pd.DataFrame(r.flights_changed_2008).sort_values(by='Flights', ascending=False)
plt.figure(figsize=[16,12])
base_color = sns.color_palette()[2]

plt.suptitle("2003 to 2006")

plt.subplot(3, 2, 1)
sns.barplot(data = flights_changed_2003, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2003")
plt.subplot(3, 2, 2)
sns.barplot(data = flights_changed_2004, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2004")
plt.subplot(3, 2, 3)
sns.barplot(data = flights_changed_2005, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2005")
plt.subplot(3, 2, 4)
sns.barplot(data = flights_changed_2006, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2006")
plt.subplot(3, 2, 5)
sns.barplot(data = flights_changed_2007, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2007")
plt.subplot(3, 2, 6)
sns.barplot(data = flights_changed_2008, y = 'Carrier', x = 'Flights', hue = 'FlightStatus')
plt.title("2008")

# Hiding the axis ticks and tick labels of the bigger plot
plt.tick_params(labelcolor="none", bottom=False, left=False)

# Adding the x-axis and y-axis labels for the bigger plot
plt.xlabel('Common X-Axis', fontsize=15, fontweight='bold')
plt.ylabel('Common Y-Axis', fontsize=15, fontweight='bold')

plt.show()

We can see that the same companies are at the top

Next I want to analyze the number of delays from a general perspective. With the level of advancements in technology, I expect to see a relative reduction in the percentage of delayed flights each year.

The following query takes over 20 minutes to run so I saved the data in a CSV file after running it, so that I don’t have to rerun it unless I change something.

get_all_flights <- function() {
    airline_db <- dbConnect(SQLite(), db_file) # Making a connection to db

    s <- "Year, 
          CASE 
              WHEN Diverted = 1 AND Cancelled = 0 THEN 'Diverted'
              WHEN Diverted = 0 AND Cancelled = 1 THEN 'Cancelled'
              ELSE 'Arrived'
          END AS FlightStatus,
          COUNT(*) AS Flights"
    
    query <- sprintf("
        SELECT %s FROM table_1987 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1988 GROUP BY Year, FlightStatus
        UNION ALL 
        SELECT %s FROM table_1989 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1990 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1991 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1992 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1993 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1994 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1995 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1996 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1997 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_1998 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_1999 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2000 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2001 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2002 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2003 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2004 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2005 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2006 GROUP BY Year, FlightStatus
        UNION ALL
        SELECT %s FROM table_2007 GROUP BY Year, FlightStatus 
        UNION ALL 
        SELECT %s FROM table_2008 GROUP BY Year, FlightStatus", 
        s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s)
    
    all_flights <- dbGetQuery(airline_db, query) # Runs the query and stores it in a dataframe, in memory♂
    dbDisconnect(airline_db) # Disconnect from the database when done
    return(all_flights)
}
# all_flights <- get_all_flights()
# write.csv(all_flights,"all_flights.csv", row.names = FALSE)
all_flights <- read.csv('all_flights.csv')
all_flights
##    Year FlightStatus Flights
## 1  1987      Arrived 1288326
## 2  1987    Cancelled   19685
## 3  1987     Diverted    3815
## 4  1988      Arrived 5137497
## 5  1988    Cancelled   50163
## 6  1988     Diverted   14436
## 7  1989      Arrived 4952196
## 8  1989    Cancelled   74165
## 9  1989     Diverted   14839
## 10 1990      Arrived 5202481
## 11 1990    Cancelled   52458
## 12 1990     Diverted   15954
## 13 1991      Arrived 5020835
## 14 1991    Cancelled   43505
## 15 1991     Diverted   12585
## 16 1992      Arrived 5027937
## 17 1992    Cancelled   52836
## 18 1992     Diverted   11384
## 19 1993      Arrived 5000323
## 20 1993    Cancelled   59845
## 21 1993     Diverted   10333
## 22 1994      Arrived 5101202
## 23 1994    Cancelled   66740
## 24 1994     Diverted   12106
## 25 1995      Arrived 5225038
## 26 1995    Cancelled   91905
## 27 1995     Diverted   10492
## 28 1996      Arrived 5209326
## 29 1996    Cancelled  128536
## 30 1996     Diverted   14121
## 31 1997      Arrived 5301999
## 32 1997    Cancelled   97763
## 33 1997     Diverted   12081
## 34 1998      Arrived 5227051
## 35 1998    Cancelled  144509
## 36 1998     Diverted   13161
## 37 1999      Arrived 5360018
## 38 1999    Cancelled  154311
## 39 1999     Diverted   13555
## 40 2000      Arrived 5481303
## 41 2000    Cancelled  187490
## 42 2000     Diverted   14254
## 43 2001      Arrived 5723673
## 44 2001    Cancelled  231198
## 45 2001     Diverted   12909
## 46 2002      Arrived 5197860
## 47 2002    Cancelled   65143
## 48 2002     Diverted    8356
## 49 2003      Arrived 6375690
## 50 2003    Cancelled  101469
## 51 2003     Diverted   11381
## 52 2004      Arrived 6987729
## 53 2004    Cancelled  127757
## 54 2004     Diverted   13784
## 55 2005      Arrived 6992838
## 56 2005    Cancelled  133730
## 57 2005     Diverted   14028
## 58 2006      Arrived 7003802
## 59 2006    Cancelled  121934
## 60 2006     Diverted   16186
## 61 2007      Arrived 7275288
## 62 2007    Cancelled  160748
## 63 2007     Diverted   17179
## 64 2008      Arrived 2319121
## 65 2008    Cancelled   64442
## 66 2008     Diverted    5654
# Only flights that did not arrive
not_arrived = pd.DataFrame(r.all_flights).query("FlightStatus != 'Arrived'")
# not_arrived

plt.figure(figsize=[16,12])
base_color = sns.color_palette()[2]

sns.lineplot(data=not_arrived, x='Year', y='Flights', hue='FlightStatus')
plt.show()